Motivation for this separate notebook: PCA does an ok job parsing out the different DDM variable types but EFA doesn’t. Initially I thought theoretically EFA would be more approrpriate since I am trying to capture ‘latent variables’ and didn’t want to assume that the variables don’t have any unique variance. I was particularly reluctant to do this since the same task correlations for all variable types were higher than correlations across tasks.

My initial attempt to run EFA on EZ variables from test data gave some errors, the fit wasn’t particularly good and the variable types did not really cluster separately. There was a cluster that included all the drift rates and another that included the majority of non-decision times. One could make an argument on using these clusters only as they capture separable processes but seems weak considering the bad fit of the model.

My earlier efforts (pre Neuroecon) used PCA. These did a better job in clustering the different variable types. This isn’t too surprising given that PCA will assume all variability within a measure is due to common variance. One could build an argument suggesting that PCA could/should be used in this case because even if the clusters are not necessarily reflecting common cognitive processes they method serves the purpose of reducing the dimensionality of rich data. This too seems weak if it’s not going to add anything to our understanding of the cognitive processes and will remain a primarily stastistical exercise.

More rencetly I am confused by other papers’ use of PCA. E.g. the Econographics paper suggests: > To summarize these clusters, we make use of principal components analysis (PCA), a statistical technique that produces components—linear combinations of variables—that explain as much variation in the underlying behaviors as possible, as discussed in Section 3. These components highlight latent dimensions underlying the econographic variables.

All of this is obviously very post-hoc. How should I go about this decision?

library(tidyverse)
theme_set(theme_bw())
options(scipen = 1, digits = 4)

helper_func_path = '/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/code/helper_functions/'
source(paste0(helper_func_path, 'get_numeric_cols.R'))
source(paste0(helper_func_path, 'remove_outliers.R'))
source(paste0(helper_func_path, 'remove_correlated_task_variables.R'))
source(paste0(helper_func_path, 'transform_remove_skew.R'))
source(paste0(helper_func_path, 'get_demographics.R'))
source(paste0(helper_func_path, 'residualize_baseline.R'))
source(paste0(helper_func_path, 'find_optimal_components.R'))

ddm_workspace_scripts = '/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_DDM_Analyses/code/workspace_scripts/'
source(paste0(ddm_workspace_scripts,'ddm_measure_labels.R'))
source(paste0(ddm_workspace_scripts,'ddm_subject_data.R'))
rm(test_data_hddm_fullfit, test_data_hddm_refit)
clean_test_data = remove_correlated_task_variables(test_data)
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Dropping 137 variables with correlations above 0.85
motor_selective_stop_signal.hddm_drift
adaptive_n_back.std_rt
adaptive_n_back.avg_rt_error
adaptive_n_back.std_rt_error
attention_network_task.EZ_drift_incongruent
attention_network_task.hddm_drift
attention_network_task.conflict_acc
attention_network_task.EZ_non_decision_congruent
attention_network_task.EZ_non_decision_neutral
attention_network_task.EZ_thresh_incongruent
attention_network_task.conflict_acc
attention_network_task.congruent_rt
attention_network_task.incongruent_rt
attention_network_task.neutral_rt
attention_network_task.hddm_drift_incongruent
attention_network_task.neutral_rt
choice_reaction_time.hddm_drift
directed_forgetting.EZ_drift_pos
directed_forgetting.acc
directed_forgetting.hddm_drift
directed_forgetting.acc
directed_forgetting.hddm_drift
directed_forgetting.EZ_non_decision_pos
directed_forgetting.avg_rt
dot_pattern_expectancy.avg_rt
dot_pattern_expectancy.hddm_drift_AY
dot_pattern_expectancy.hddm_drift
dot_pattern_expectancy.EZ_non_decision_AX
local_global_letter.EZ_drift_global
local_global_letter.EZ_drift_switch
local_global_letter.acc
local_global_letter.hddm_drift
local_global_letter.hddm_drift
local_global_letter.hddm_drift
local_global_letter.EZ_non_decision_congruent
local_global_letter.EZ_non_decision_global
local_global_letter.EZ_non_decision_local
local_global_letter.EZ_non_decision_neutral
local_global_letter.EZ_non_decision_stay
local_global_letter.EZ_non_decision_switch
local_global_letter.avg_rt
local_global_letter.congruent_rt
local_global_letter.EZ_non_decision_local
local_global_letter.EZ_non_decision_stay
local_global_letter.EZ_non_decision_switch
local_global_letter.congruent_rt
local_global_letter.EZ_non_decision_neutral
local_global_letter.EZ_non_decision_switch
local_global_letter.avg_rt
local_global_letter.incongruent_rt
local_global_letter.local_incongruent_rt
local_global_letter.EZ_non_decision_stay
local_global_letter.EZ_non_decision_switch
local_global_letter.EZ_non_decision_switch
local_global_letter.avg_rt
local_global_letter.congruent_rt
local_global_letter.EZ_thresh_global
local_global_letter.EZ_thresh_incongruent
local_global_letter.EZ_thresh_stay
local_global_letter.EZ_thresh_switch
local_global_letter.congruent_rt
local_global_letter.global_congruent_rt
local_global_letter.incongruent_rt
local_global_letter.local_congruent_rt
local_global_letter.conflict_acc
local_global_letter.incongruent_harm_acc
local_global_letter.hddm_drift_incongruent
local_global_letter.global_congruent_rt
local_global_letter.local_congruent_rt
local_global_letter.global_bias_acc
local_global_letter.incongruent_rt
local_global_letter.local_incongruent_rt
local_global_letter.switch_cost_acc
recent_probes.EZ_drift_xrec
recent_probes.hddm_drift
recent_probes.EZ_drift_xrec_pos
recent_probes.hddm_drift
recent_probes.EZ_non_decision_xrec
recent_probes.EZ_non_decision_xrec_neg
shape_matching.hddm_drift
simon.EZ_drift_incongruent
simon.hddm_drift
simon.hddm_drift
simon.EZ_non_decision_congruent
simon.EZ_non_decision_incongruent
simon.EZ_thresh_incongruent
simon.congruent_sd_rt
simon.incongruent_acc
simon.congruent_avg_rt
simon.incongruent_avg_rt
simon.incongruent_avg_rt
simon.std_rt
simon.simon_acc
simon.std_rt
stim_selective_stop_signal.ignore_rt_error
stim_selective_stop_signal.stop_rt_error
stim_selective_stop_signal.stop_rt_error
motor_selective_stop_signal.hddm_drift
stim_selective_stop_signal.ignore_rt_error
stim_selective_stop_signal.stop_rt_error
stim_selective_stop_signal.stop_rt_error
stop_signal.hddm_drift
stop_signal.SSRT_high
stop_signal.SSRT_low
stop_signal.stop_rt_error
stop_signal.total_errors
stroop.EZ_drift_incongruent
stroop.hddm_drift
stroop.hddm_drift
stroop.EZ_non_decision_congruent
stroop.EZ_non_decision_incongruent
stroop.avg_rt
stroop.congruent_rt
stroop.congruent_rt
stroop.avg_rt
stroop.incongruent_rt
stroop.EZ_thresh_incongruent
stroop.congruent_rt
stroop.incongruent_rt
stroop.incongruent_rt
stroop.incongruent_errors
threebytwo.EZ_drift_task_switch_100.0
threebytwo.EZ_drift_task_switch_900.0
threebytwo.acc
threebytwo.hddm_drift
threebytwo.acc
threebytwo.hddm_drift
threebytwo.hddm_drift
threebytwo.EZ_non_decision_cue_switch_100.0
threebytwo.EZ_non_decision_task_switch_100.0
threebytwo.EZ_non_decision_task_switch_900.0
threebytwo.EZ_non_decision_task_switch_100.0
threebytwo.avg_rt
threebytwo.EZ_thresh_task_switch_100.0
threebytwo.EZ_thresh_task_switch_900.0
threebytwo.hddm_drift
threebytwo.avg_rt_error
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Remove outliers (>2.5 SD away)

clean_test_data = cbind(sub_id = clean_test_data$sub_id, as.data.frame(apply(clean_test_data[, -which(names(clean_test_data) %in% c("sub_id"))], 2, remove_outliers)))

Transform skewed variables

numeric_cols = get_numeric_cols()
numeric_cols = numeric_cols[numeric_cols %in% names(clean_test_data) == T]
clean_test_data = transform_remove_skew(clean_test_data, numeric_cols)
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
18 data positively skewed data were transformed:
adaptive_n_back.missed_percent
attention_network_task.avg_rt_error
attention_network_task.missed_percent
attention_network_task.std_rt_error
directed_forgetting.missed_percent
dot_pattern_expectancy.AX_errors
dot_pattern_expectancy.AY_errors
dot_pattern_expectancy.BX_errors
dot_pattern_expectancy.hddm_thresh
dot_pattern_expectancy.missed_percent
local_global_letter.global_congruent_errors
local_global_letter.local_congruent_errors
local_global_letter.missed_percent
shape_matching.missed_percent
simon.std_rt_error
stim_selective_stop_signal.hddm_thresh
stroop.missed_percent
threebytwo.missed_percent
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
13 positively skewed data could not be transformed successfully:
adaptive_n_back.missed_percent
attention_network_task.avg_rt_error
attention_network_task.missed_percent
attention_network_task.std_rt_error
directed_forgetting.missed_percent
dot_pattern_expectancy.missed_percent
local_global_letter.global_congruent_errors
local_global_letter.local_congruent_errors
local_global_letter.missed_percent
shape_matching.missed_percent
simon.std_rt_error
stroop.missed_percent
threebytwo.missed_percent
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
8 data negatively skewed data were transformed:
attention_network_task.acc
choice_reaction_time.acc
dot_pattern_expectancy.acc
motor_selective_stop_signal.ignore_acc
shape_matching.acc
simon.congruent_acc
stim_selective_stop_signal.go_acc
stim_selective_stop_signal.ignore_acc
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
6 negatively skewed data could not be transformed successfully:
choice_reaction_time.acc
dot_pattern_expectancy.acc
motor_selective_stop_signal.ignore_acc
shape_matching.acc
simon.congruent_acc
stim_selective_stop_signal.go_acc
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Drop subject identifier column, mean impute and drop cols with no variance

clean_test_data_std = clean_test_data %>% mutate_if(is.numeric, scale)
clean_test_data_std = clean_test_data_std %>% select(-sub_id)

#mean imputation
clean_test_data_std[is.na(clean_test_data_std)]=0

#drop cols with no variance
clean_test_data_std = clean_test_data_std %>%
  select_if(function(col) sd(col) != 0)

Extract EZ and HDDM variables

clean_test_data_ez = clean_test_data_std %>%
  select(grep('EZ', names(clean_test_data_std), value=T))

clean_test_data_hddm = clean_test_data_std %>%
  select(grep('hddm', names(clean_test_data_std), value=T))

Residualize Age and Sex effects

data_path = '/Users/zeynepenkavi/Documents/PoldrackLabLocal/Self_Regulation_Ontology/Data/'
release = 'Complete_03-29-2018/'
dataset = 'demographic_health.csv'

demographics = get_demographics(dataset = paste0(data_path, release, dataset))

demographics = demographics %>%
  filter(X %in% clean_test_data$sub_id)
clean_test_data_ez = cbind(clean_test_data_ez, demographics[,c("Age", "Sex")])

res_clean_test_data_ez = residualize_baseline(clean_test_data_ez)
Warning: Unknown variables: `X`, `sub_id`, `subj_id`
res_clean_test_data_ez

PCA on EZ variables of test data

pca_ez_t1_comp_metrics = find_optimal_components(res_clean_test_data_ez, minc=2, maxc=20, model = "PCA")
pca_ez_t1_comp_metrics
ez_t1_pca_3 = principal(res_clean_test_data_ez, nfactors=3, rotate="oblimin")
Warning in cor.smooth(r): Matrix was not positive definite, smoothing was
done
The determinant of the smoothed correlation was zero.
This means the objective function is not defined for the null model either.
The Chi square is thus based upon observed correlations.
Warning in principal(res_clean_test_data_ez, nfactors = 3, rotate =
"oblimin"): The matrix is not positive semi-definite, scores found from
Structure loadings
data.frame(ez_t1_pca_3$values) %>%
  rename(eig = ez_t1_pca_3.values) %>%
  arrange(-eig) %>%
  mutate(var_pct = eig/sum(eig)*100,
         pc = 1:n()) %>%
  filter(pc<11)%>%
  ggplot(aes(factor(pc), var_pct))+
  geom_bar(stat="identity")+
  ylab("Percentage of variance explained")+
  xlab("Principal component")

data.frame(ez_t1_pca_3$values) %>%
  rename(eig = ez_t1_pca_3.values) %>%
  arrange(-eig) %>%
  mutate(var_pct = eig/sum(eig)*100,
         pc = 1:n(),
         var_pct_shift = lead(var_pct),
         var_pct_diff = var_pct - var_pct_shift) %>%
  filter(pc<11) %>%
  arrange(-var_pct_diff)
ez_t1_pca_3_loadings = as.data.frame(ez_t1_pca_3$loadings[])

ez_t1_pca_3_loadings[abs(ez_t1_pca_3_loadings)<0.3]=NA

tmp = ez_t1_pca_3_loadings %>%
  mutate(dv = row.names(.)) %>%
  select(dv, TC1, TC2, TC3) %>%
  mutate(num_loading = 3-(is.na(TC1)+is.na(TC2)+is.na(TC3) ) ) %>%
  filter(num_loading!=0) %>%
  select(-num_loading) %>%
  arrange(-TC1, -TC2, -TC3) %>%
  mutate(order_num = 1:n(),
         dv = reorder(dv, -order_num)) %>%
  select(-order_num) %>%
  gather(Factor, Loading, -dv) %>%
  na.exclude() %>%
  mutate(load_sign = factor(ifelse(Loading>0,"pos","neg")))

var_type = ifelse(grepl("drift", tmp$dv), "#7fc97f",
                  ifelse(grepl("thresh", tmp$dv), "#beaed4",
                         ifelse(grepl("non_dec", tmp$dv), "#fdc086", NA)))         
tmp%>%         
  ggplot(aes(dv, abs(Loading), fill=load_sign))+
  geom_bar(stat = "identity")+
  facet_wrap(~Factor, nrow=1)+
  coord_flip()+
  xlab("")+
  theme(legend.position = "none",
        axis.text.y = element_text(color = var_type))+
  ylab("Absolute Loading")

EFA on EZ variables of test data

efa_ez_t1_comp_metrics = find_optimal_components(res_clean_test_data_ez, fm = "minres", minc=2)
efa_ez_t1_comp_metrics
ez_t1_fa_8 = fa(res_clean_test_data_ez, efa_ez_t1_comp_metrics$comp[1], rotate='oblimin', fm='minres', scores='tenBerge')
Warning in cor.smooth(R): Matrix was not positive definite, smoothing was
done

Warning in cor.smooth(R): Matrix was not positive definite, smoothing was
done

Warning in cor.smooth(R): Matrix was not positive definite, smoothing was
done
Warning in cor.smooth(r): Matrix was not positive definite, smoothing was
done
The determinant of the smoothed correlation was zero.
This means the objective function is not defined for the null model either.
The Chi square is thus based upon observed correlations.
The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
Warning in cor.smooth(r): Matrix was not positive definite, smoothing was
done
ez_t1_fa_3 = fa(res_clean_test_data_ez, 3, rotate='oblimin', fm='minres', scores='tenBerge')
Warning in cor.smooth(R): Matrix was not positive definite, smoothing was
done

Warning in cor.smooth(R): Matrix was not positive definite, smoothing was
done

Warning in cor.smooth(R): Matrix was not positive definite, smoothing was
done
Warning in cor.smooth(r): Matrix was not positive definite, smoothing was
done
The determinant of the smoothed correlation was zero.
This means the objective function is not defined for the null model either.
The Chi square is thus based upon observed correlations.
The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
Warning in cor.smooth(r): Matrix was not positive definite, smoothing was
done
anova(ez_t1_fa_3, ez_t1_fa_8)
ez_t1_fa_3_loadings = as.data.frame(ez_t1_fa_3$loadings[])

ez_t1_fa_3_loadings[abs(ez_t1_fa_3_loadings)<0.3]=NA

tmp = ez_t1_fa_3_loadings %>%
  mutate(dv = row.names(.)) %>%
  select(dv, MR1, MR2, MR3) %>%
  mutate(num_loading = 3-(is.na(MR1)+is.na(MR2)+is.na(MR3) ) ) %>%
  filter(num_loading!=0) %>%
  select(-num_loading) %>%
  arrange(-MR1, -MR2, -MR3) %>%
  mutate(order_num = 1:n(),
         dv = reorder(dv, -order_num)) %>%
  select(-order_num) %>%
  gather(Factor, Loading, -dv) %>%
  na.exclude() %>%
  mutate(load_sign = factor(ifelse(Loading>0,"pos","neg")))

var_type = ifelse(grepl("drift", tmp$dv), "#7fc97f",
                  ifelse(grepl("thresh", tmp$dv), "#beaed4",
                         ifelse(grepl("non_dec", tmp$dv), "#fdc086", NA)))         
tmp%>%         
  ggplot(aes(dv, abs(Loading), fill=load_sign))+
  geom_bar(stat = "identity")+
  facet_wrap(~Factor, nrow=1)+
  coord_flip()+
  xlab("")+
  theme(legend.position = "none",
        axis.text.y = element_text(color = var_type))+
  ylab("Absolute Loading")

ez_t1_fa_8_loadings = as.data.frame(ez_t1_fa_8$loadings[])

ez_t1_fa_8_loadings[abs(ez_t1_fa_8_loadings)<0.3]=NA

tmp = ez_t1_fa_8_loadings %>%
  mutate(dv = row.names(.)) %>%
  select(dv, MR1, MR2, MR3, MR4, MR5, MR6, MR7, MR8) %>%
  mutate(num_loading = 8-(is.na(MR1)+is.na(MR2)+is.na(MR3)+is.na(MR4)+is.na(MR5)+is.na(MR6)+is.na(MR7)+is.na(MR8) ) ) %>%
  filter(num_loading!=0) %>%
  select(-num_loading) %>%
  arrange(-MR1, -MR2, -MR3,-MR1, -MR5, -MR6,-MR7, -MR8) %>%
  mutate(order_num = 1:n(),
         dv = reorder(dv, -order_num)) %>%
  select(-order_num) %>%
  gather(Factor, Loading, -dv) %>%
  na.exclude() %>%
  mutate(load_sign = factor(ifelse(Loading>0,"pos","neg")))

var_type = ifelse(grepl("drift", tmp$dv), "#7fc97f",
                  ifelse(grepl("thresh", tmp$dv), "#beaed4",
                         ifelse(grepl("non_dec", tmp$dv), "#fdc086", NA)))         
tmp%>%         
  ggplot(aes(dv, abs(Loading), fill=load_sign))+
  geom_bar(stat = "identity")+
  facet_wrap(~Factor, nrow=1)+
  coord_flip()+
  xlab("")+
  theme(legend.position = "none",
        axis.text.y = element_text(color = var_type))+
  ylab("Absolute Loading")

PCA on HDDM variables of test data

EFA on HDDM variables of test data

PCA on EZ variables of retest data

EFA on EZ variables of retest data

PCA on HDDM variables of retest data

EFA on HDDM variables of retest data

---
title: 'Different approaches to dimensionality reduction of DDM variables'
output:
github_document:
toc: yes
toc_float: yes
---

Motivation for this separate notebook: PCA does an ok job parsing out the different DDM variable types but EFA doesn't. Initially I thought theoretically EFA would be more approrpriate since I am trying to capture 'latent variables' and didn't want to assume that the variables don't have any unique variance. I was particularly reluctant to do this since the same task correlations for all variable types were higher than correlations across tasks.

My initial attempt to run EFA on EZ variables from test data gave some errors, the fit wasn't particularly good and the variable types did not really cluster separately. There was a cluster that included all the drift rates and another that included the majority of non-decision times. One could make an argument on using these clusters only as they capture separable processes but seems weak considering the bad fit of the model.

My earlier efforts (pre Neuroecon) used PCA. These did a better job in clustering the different variable types. This isn't too surprising given that PCA will assume all variability within a measure is due to common variance. One could build an argument suggesting that PCA could/should be used in this case because even if the clusters are not necessarily reflecting common cognitive processes they method serves the purpose of reducing the dimensionality of rich data. This too seems weak if it's not going to add anything to our understanding of the cognitive processes and will remain a primarily stastistical exercise.

More rencetly I am confused by other papers' use of PCA. E.g. the Econographics paper suggests:
> To summarize these clusters, we make use of principal components analysis (PCA), a statistical technique that produces components—linear combinations of variables—that explain as much variation in the underlying behaviors as possible, as discussed in Section 3. These components highlight latent dimensions underlying the econographic variables.

All of this is obviously very post-hoc. How should I go about this decision?

```{r}
library(tidyverse)
theme_set(theme_bw())
options(scipen = 1, digits = 4)

helper_func_path = '/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/code/helper_functions/'
source(paste0(helper_func_path, 'get_numeric_cols.R'))
source(paste0(helper_func_path, 'remove_outliers.R'))
source(paste0(helper_func_path, 'remove_correlated_task_variables.R'))
source(paste0(helper_func_path, 'transform_remove_skew.R'))
source(paste0(helper_func_path, 'get_demographics.R'))
source(paste0(helper_func_path, 'residualize_baseline.R'))
source(paste0(helper_func_path, 'find_optimal_components.R'))

ddm_workspace_scripts = '/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_DDM_Analyses/code/workspace_scripts/'
source(paste0(ddm_workspace_scripts,'ddm_measure_labels.R'))
source(paste0(ddm_workspace_scripts,'ddm_subject_data.R'))
rm(test_data_hddm_fullfit, test_data_hddm_refit)
```

```{r}
clean_test_data = remove_correlated_task_variables(test_data)
```

Remove outliers (>2.5 SD away)

```{r}
clean_test_data = cbind(sub_id = clean_test_data$sub_id, as.data.frame(apply(clean_test_data[, -which(names(clean_test_data) %in% c("sub_id"))], 2, remove_outliers)))
```

Transform skewed variables

```{r}
numeric_cols = get_numeric_cols()
numeric_cols = numeric_cols[numeric_cols %in% names(clean_test_data) == T]
clean_test_data = transform_remove_skew(clean_test_data, numeric_cols)
```

Drop subject identifier column, mean impute and drop cols with no variance

```{r}
clean_test_data_std = clean_test_data %>% mutate_if(is.numeric, scale)
clean_test_data_std = clean_test_data_std %>% select(-sub_id)

#mean imputation
clean_test_data_std[is.na(clean_test_data_std)]=0

#drop cols with no variance
clean_test_data_std = clean_test_data_std %>%
  select_if(function(col) sd(col) != 0)
```

Extract EZ and HDDM variables

```{r}
clean_test_data_ez = clean_test_data_std %>%
  select(grep('EZ', names(clean_test_data_std), value=T))

clean_test_data_hddm = clean_test_data_std %>%
  select(grep('hddm', names(clean_test_data_std), value=T))
```

Residualize Age and Sex effects 

```{r}
data_path = '/Users/zeynepenkavi/Documents/PoldrackLabLocal/Self_Regulation_Ontology/Data/'
release = 'Complete_03-29-2018/'
dataset = 'demographic_health.csv'

demographics = get_demographics(dataset = paste0(data_path, release, dataset))

demographics = demographics %>%
  filter(X %in% clean_test_data$sub_id)
```

```{r}
clean_test_data_ez = cbind(clean_test_data_ez, demographics[,c("Age", "Sex")])

res_clean_test_data_ez = residualize_baseline(clean_test_data_ez)

res_clean_test_data_ez
```

## PCA on EZ variables of test data

```{r warning=FALSE, message=FALSE}
pca_ez_t1_comp_metrics = find_optimal_components(res_clean_test_data_ez, minc=2, maxc=20, model = "PCA")
```

```{r}
pca_ez_t1_comp_metrics
```

```{r}
ez_t1_pca_3 = principal(res_clean_test_data_ez, nfactors=3, rotate="oblimin")
```

```{r}
data.frame(ez_t1_pca_3$values) %>%
  rename(eig = ez_t1_pca_3.values) %>%
  arrange(-eig) %>%
  mutate(var_pct = eig/sum(eig)*100,
         pc = 1:n()) %>%
  filter(pc<11)%>%
  ggplot(aes(factor(pc), var_pct))+
  geom_bar(stat="identity")+
  ylab("Percentage of variance explained")+
  xlab("Principal component")
```

```{r}
data.frame(ez_t1_pca_3$values) %>%
  rename(eig = ez_t1_pca_3.values) %>%
  arrange(-eig) %>%
  mutate(var_pct = eig/sum(eig)*100,
         pc = 1:n(),
         var_pct_shift = lead(var_pct),
         var_pct_diff = var_pct - var_pct_shift) %>%
  filter(pc<11) %>%
  arrange(-var_pct_diff)
```

```{r}
ez_t1_pca_3_loadings = as.data.frame(ez_t1_pca_3$loadings[])

ez_t1_pca_3_loadings[abs(ez_t1_pca_3_loadings)<0.3]=NA

tmp = ez_t1_pca_3_loadings %>%
  mutate(dv = row.names(.)) %>%
  select(dv, TC1, TC2, TC3) %>%
  mutate(num_loading = 3-(is.na(TC1)+is.na(TC2)+is.na(TC3) ) ) %>%
  filter(num_loading!=0) %>%
  select(-num_loading) %>%
  arrange(-TC1, -TC2, -TC3) %>%
  mutate(order_num = 1:n(),
         dv = reorder(dv, -order_num)) %>%
  select(-order_num) %>%
  gather(Factor, Loading, -dv) %>%
  na.exclude() %>%
  mutate(load_sign = factor(ifelse(Loading>0,"pos","neg")))

var_type = ifelse(grepl("drift", tmp$dv), "#7fc97f",
                  ifelse(grepl("thresh", tmp$dv), "#beaed4",
                         ifelse(grepl("non_dec", tmp$dv), "#fdc086", NA)))         
tmp%>%         
  ggplot(aes(dv, abs(Loading), fill=load_sign))+
  geom_bar(stat = "identity")+
  facet_wrap(~Factor, nrow=1)+
  coord_flip()+
  xlab("")+
  theme(legend.position = "none",
        axis.text.y = element_text(color = var_type))+
  ylab("Absolute Loading")
```

## EFA on EZ variables of test data

```{r warning=FALSE, message=FALSE}
efa_ez_t1_comp_metrics = find_optimal_components(res_clean_test_data_ez, fm = "minres", minc=2)
```

```{r}
efa_ez_t1_comp_metrics
```

```{r}
ez_t1_fa_8 = fa(res_clean_test_data_ez, efa_ez_t1_comp_metrics$comp[1], rotate='oblimin', fm='minres', scores='tenBerge')
```

```{r}
ez_t1_fa_3 = fa(res_clean_test_data_ez, 3, rotate='oblimin', fm='minres', scores='tenBerge')
```

```{r}
anova(ez_t1_fa_3, ez_t1_fa_8)
```

```{r}
ez_t1_fa_3_loadings = as.data.frame(ez_t1_fa_3$loadings[])

ez_t1_fa_3_loadings[abs(ez_t1_fa_3_loadings)<0.3]=NA

tmp = ez_t1_fa_3_loadings %>%
  mutate(dv = row.names(.)) %>%
  select(dv, MR1, MR2, MR3) %>%
  mutate(num_loading = 3-(is.na(MR1)+is.na(MR2)+is.na(MR3) ) ) %>%
  filter(num_loading!=0) %>%
  select(-num_loading) %>%
  arrange(-MR1, -MR2, -MR3) %>%
  mutate(order_num = 1:n(),
         dv = reorder(dv, -order_num)) %>%
  select(-order_num) %>%
  gather(Factor, Loading, -dv) %>%
  na.exclude() %>%
  mutate(load_sign = factor(ifelse(Loading>0,"pos","neg")))

var_type = ifelse(grepl("drift", tmp$dv), "#7fc97f",
                  ifelse(grepl("thresh", tmp$dv), "#beaed4",
                         ifelse(grepl("non_dec", tmp$dv), "#fdc086", NA)))         
tmp%>%         
  ggplot(aes(dv, abs(Loading), fill=load_sign))+
  geom_bar(stat = "identity")+
  facet_wrap(~Factor, nrow=1)+
  coord_flip()+
  xlab("")+
  theme(legend.position = "none",
        axis.text.y = element_text(color = var_type))+
  ylab("Absolute Loading")
```

```{r}
ez_t1_fa_8_loadings = as.data.frame(ez_t1_fa_8$loadings[])

ez_t1_fa_8_loadings[abs(ez_t1_fa_8_loadings)<0.3]=NA

tmp = ez_t1_fa_8_loadings %>%
  mutate(dv = row.names(.)) %>%
  select(dv, MR1, MR2, MR3, MR4, MR5, MR6, MR7, MR8) %>%
  mutate(num_loading = 8-(is.na(MR1)+is.na(MR2)+is.na(MR3)+is.na(MR4)+is.na(MR5)+is.na(MR6)+is.na(MR7)+is.na(MR8) ) ) %>%
  filter(num_loading!=0) %>%
  select(-num_loading) %>%
  arrange(-MR1, -MR2, -MR3,-MR1, -MR5, -MR6,-MR7, -MR8) %>%
  mutate(order_num = 1:n(),
         dv = reorder(dv, -order_num)) %>%
  select(-order_num) %>%
  gather(Factor, Loading, -dv) %>%
  na.exclude() %>%
  mutate(load_sign = factor(ifelse(Loading>0,"pos","neg")))

var_type = ifelse(grepl("drift", tmp$dv), "#7fc97f",
                  ifelse(grepl("thresh", tmp$dv), "#beaed4",
                         ifelse(grepl("non_dec", tmp$dv), "#fdc086", NA)))         
tmp%>%         
  ggplot(aes(dv, abs(Loading), fill=load_sign))+
  geom_bar(stat = "identity")+
  facet_wrap(~Factor, nrow=1)+
  coord_flip()+
  xlab("")+
  theme(legend.position = "none",
        axis.text.y = element_text(color = var_type))+
  ylab("Absolute Loading")
```


## PCA on HDDM variables of test data

## EFA on HDDM variables of test data

## PCA on EZ variables of retest data

## EFA on EZ variables of retest data

## PCA on HDDM variables of retest data

## EFA on HDDM variables of retest data